Consistent Weighted Sampling

نویسندگان

  • Mark Manasse
  • Frank McSherry
  • Kunal Talwar
چکیده

We describe an efficient procedure for sampling representatives from a weighted set such that for any weightings S and T , the probability that the two choose the same sample is equal to the Jaccard similarity between them: Pr[sample(S) = sample(T )] = ∑ x min(S(x), T (x)) ∑ x max(S(x), T (x)) where sample(S) is a pair (x, y) with 0 < y ≤ S(x). The sampling process takes expected computation linear in the number of non-zero weights in S, independent of the weights themselves. Sampling computations of this form are commonly limited mainly by the required (pseudo) randomness, which must be carefully maintained and reproduced to provide the consistency properties. Whereas previous approaches require randomness dependent on the sizes of the weights, we use an expected number of bits per weight independent of the values of the weights themselves. Furthermore, we discuss and develop the implementation of our sampling schemes, reducing the requisite computation and randomness substantially in practice.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Median Estimation in Sample Surveys

In a recent paper Maritz and Jarrett (1978) proposed a small-sample estimate of the variance of sample medians from continuous population. In this paper their methods are adapted to median estimation in s~atified sampling without replacement from finite populations. A weighted sample median for estimating the median of heavy-tailed or skewed populations is proposed. Its asymptotic normal distri...

متن کامل

Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression

Weighted likelihood, in which one solves Horvitz-Thompson or inverse probability weighted (IPW) versions of the likelihood equations, offers a simple and robust method for fitting models to two phase stratified samples. We consider semiparametric models for which solution of infinite dimensional estimating equations leads to √ N consistent and asymptotically Gaussian estimators of both Euclidea...

متن کامل

Weighted Empirical Likelihood in Some Two-sample Semiparametric Models with Various Types of Censored Data

In this article, the weighted empirical likelihood is applied to a general setting of two-sample semiparametric models, which includes biased sampling models and case-control logistic regression models as special cases. For various types of censored data, such as right censored data, doubly censored data, interval censored data and partly interval-censored data, the weighted empirical likelihoo...

متن کامل

Consistent Weighted Sampling Made More Practical

Min-Hash, which is widely used for efficiently estimating similarities of bag-of-words represented data, plays an increasingly important role in the era of big data. It has been extended to deal with real-value weighted sets – Improved Consistent Weighted Sampling (ICWS) is considered as the state-of-the-art for this problem. In this paper, we propose a Practical CWS (PCWS) algorithm. We first ...

متن کامل

Improved Consistent Weighted Sampling Revisited

Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch weighted sets and has drawn increasing interest from the community. Due to its constant-time complexity independent of the values of the weights, Improved CWS (ICWS) is considered as the state-of-the-art CWS algorithm. In ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006